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Abstract 

We propose a new method of classifying 
documents into categories. We define for 
each category a finite mixture model based 
on soft clustering of words. We treat the 
problem of classifying documents as that 
of conducting statistical hypothesis testing 
over finite mixture models, and employ the 
EM algorithm to efficiently estimate pa- 
rameters in a finite mixture model. Exper- 
imental results indicate that our method 
outperforms existing methods. 



1 Introduction 

We are concerned here with the issue of classifying 
documents into categories. More precisely, we begin 
with a number of categories (e.g., 'tennis, soccer, 
skiing'), each already containing certain documents. 
Our goal is to determine into which categories newly 
given documents ought to be assigned, and to do so 
on the basis of the distribution of each document's 
words.0 

Many methods have been proposed to address 
this issue, and a number of them have proved to 



be quite effective (e.g.,(Apte, Damerau, and Weiss, 


1994 


; Cohen and Singer, 1996; Lewis, 1992; 


Lewis 


and Ringuette, 1994; Lewis et al., 1996; [3chutzc, 


Hull, and Pedcrscn, 1995; 


Yang and Chute, 1994)). 



The simple method of conducting hypothesis testing 
over word-based distributions in categories (defined 



data size is not sufficiently large for accurately es- 
timating them. In order to address this difficulty, 
( jGuthric, Walker, and Guthrie, 1994 ) have proposed 
using distributions based on what we refer to as hard 
clustering of words, i.e., in which a word is assigned 
to a single cluster and words in the same cluster are 
treated uniformly. The use of hard clustering might, 
however, degrade classification results, since the dis- 
tributions it employs are not always precise enough 
for representing the differences between categories. 

We propose here to employ soft clustering^, i.e., 
a word can be assigned to several different clusters 
and each cluster is characterized by a specific word 
probability distribution. We define for each cate- 
gory a finite mixture model, which is a linear com- 
bination of the word probability distributions of the 
clusters. We thereby treat the problem of classify- 
ing documents as that of conducting statistical hy- 
pothesis testing over finite mixture models. In or- 
der to accomplish hypothesis testing, we employ the 
EM algorithm to efficiently and approximately cal- 
culate from training data the maximum likelihood 
estimates of parameters in a finite mixture model. 

Our method overcomes the major drawbacks of 
the method using word-based distributions and the 
method based on hard clustering, while retaining 
their merits; it in fact includes those two methods 
as special cases. Experimental results indicate that 
our method outperforms them. 

Although the finite mixture model has already 
been used elsewhere in natural language processing 
(e.g. (lelinek and Mercer, 198C; Pereira, Tishby. 



in Section 2) is not efficient in storage and suffers and Lee, 1993| )), this is the first work, to the best of 



from the data sparseness problem, i.e., the number 
of parameters in the distributions is large and the 



knowledge, that uses it in the context of document 
classification. 



1 A related issue is the retrieval, from a data base, of 



document 1 ) fe.e.. ijDeerweste 


r et al.. 199ot Puhr. 


198£ 


Robertson and Jones, 1976 




Salton and McGill, 


1983 


Wong and Yao, 198£)). 







2 We borrow from (Pereira, Tishby, and Lee, 1993) 
the terms hard clustering and soft clustering, which were 
used there in a different task. 



2 Previous Work 
Word-based method 

A simple approach to document classification is to 
view this problem as that of conducting hypothesis 
testing over word-based distributions. In this paper, 
we refer to this approach as the word-based method 
(hereafter, referred to as WBM). 

Letting W denote a vocabulary (a set of words), 
and w denote a random variable representing any 
word in it, for each category c,; (i — 1, • • • , n), we 
define its word-based distribution P{w\ci) as a his- 
togram type of distribution over W . (The num- 
ber of free parameters of such a distribution is thus 
\W\— 1). WBM then views a document as a sequence 
of words: 

d = wi, ■ ■ ■ ,w N (1) 

and assumes that each word is generated indepen- 
dently according to a probability distribution of a 
category. It then calculates the probability of a doc- 
ument with respect to a category as 



■,k m satisfying U^ =1 fcj = W and 



N 



P(d\ci) = P(wi,- ■ - ,w N \ci) = J^P(w t |ci 



(2) 



t=i 



and classifies the document into that category for 
which the calculated probability is the largest. We 
should note here that a document's probability with 
respect to each category is equivalent to the likeli- 
hood of each category with respect to the document, 
and to classify the document into the category for 
which it has the largest probability is equivalent to 
classifying it into the category having the largest 
likelihood with respect to it. Hereafter, we will use 
only the term likelihood and denote it as L(d\ct). 

Notice that in practice the parameters in a dis- 
tribution must be estimated from training data. In 
the case of WBM, the number of parameters is large; 
the training data size, however, is usually not suffi- 
ciently large for accurately estimating them. This 
is the data sparseness problem that so often stands 
in the way of reliable statistical language processing 
(c.g.( Galc and Church, 1990 )). Moreover, the num- 
ber of parameters in word-based distributions is too 
large to be efficiently stored. 

Method based on hard clustering 

In order to address the above difficulty, Guthrie 
et.al. have pr oposed a method based on hard cluster - 
ing of words ( Guthrie, Walker, and Guthrie, 1994 ) 
(hereafter we will refer to this method as HCM). Let 
ci,---,c n be categories. HCM first conducts hard 
clustering of words. Specifically, it (a) defines a vo- 
cabulary as a set of words W and defines as clusters 



fl kj — (i 7^ j) (i.e., each word is assigned only 



its subsets k\, 
k 

to a single cluster); and (b) treats uniformly all the 
words assigned to the same cluster. HCM then de- 
fines for each category Ci a distribution of the clus- 
ters P(kj\ci) (j = 1, • • • , m). It replaces each word 
Wt in the document with the cluster k t to which it 
belongs (t = 1, • • • , N). It assumes that a cluster kt 
is distributed according to P(kj\ci) and calculates 



the likelihood of each category 
the document by 



Cj with respect to 



N 



L(d\c i )=L{k 1 ,---,k N \c i ) = Y[P(k t \c i ). (3) 



Table 1: Frequencies of words 





racket 


stroke 


shot 


goal 


kick 


ball 


Cl 


4 


1 


2 


1 





2 


C2 











3 


2 


2 



Tabic 2: Clusters and words (L — 5,M = 5) 




Table 3: Frequencies of clusters 





h 


k 2 


k 3 


Cl 


7 





3 


C2 





2 


5 



There are any number of ways to create clusters in 
hard clustering, but the method employed is crucial 
to the accuracy of document classification. Guthrie 
et. al. have devised a way suitable to documentation 
classification. Suppose that there are two categories 
ci='tennis' and C2='soccer,' and we obtain from the 
training data (previously classified documents) the 
frequencies of words in each category, such as those 
in Tab. ^. Letting L and M be given positive inte- 
gers, HCM creates three clusters: k\, k 2 and ^3, in 
which k\ contains those words which are among the 
L most frequent words in ci, and not among the M 
most frequent in c 2 \ k 2 contains those words which 
are among the L most frequent words in C2, and 
not among the M most frequent in c\\ and k% con- 
tains all remaining words (see Tab. 0). HCM then 



Table 4: Probability distributions of clusters 





ki k 2 k 3 


Cl 
C2 


0.65 0.04 0.30 
0.06 0.29 0.65 



counts the frequencies of clusters in each category 
(see Tab. |^) and estimates the probabilities of clus- 
ters being in each category (see Tab. ||).(] Suppose 
that a newly given document, like d in Fig. [l], is to 
be classified. HCM calculates the likelihood values 
L(d\c 1 ) and L(d\c 2 ) according to Eq. (||). (Tab. || 
shows the logarithms of the resulting likelihood val- 
ues.) It then classifies d into c 2 , as log 2 L(d\c 2 ) is 
larger than log 2 L(d\ci). 



2. HCM cannot make the best use of information 
about the differences among the frequencies of 
words assigned to an individual cluster. For ex- 
ample, it treats 'racket' and 'shot' uniformly be- 
cause they are assigned to the same cluster ki 
(see Tab. ||). 'Racket' may, however, be more 
indicative of c\ than 'shot,' because it appears 
more frequently in ci than 'shot.' HCM fails 
to utilize this information. This problem will 
become more serious when the values L and M 
in word clustering are large, which renders the 
clustering itself relatively meaningless. 

From the perspective of number of parameters, 
HCM employs models having very few parameters, 
and thus may not sometimes represent much useful 
information for classification. 



d = kick, goal, goal, ball 
Figure 1: Example document 

Table 5: Calculating log likelihood values 



log 2 L(d|ci) 






= 1 x log 2 .04 - 


- 3 x log 2 


.30 = -9.85 


log 2 L(d\c 2 ) 






= 1 x log 2 .29 - 


- 3 x log 2 


.65 = -3.65 



HCM can handle the data sparseness problem 
quite well. By assigning words to clusters, it can 
drastically reduce the number of parameters to be 
estimated. It can also save space for storing knowl- 
edge. We argue, however, that the use of hard clus- 
tering still has the following two problems: 

1. HCM cannot assign a word to more than one 
cluster at a time. Suppose that there is another 
category C3 = 'skiing' in which the word 'ball' 
does not appear, i.e., 'ball' will be indicative of 
both ci and c 2 , but not C3. If we could assign 
'ball' to both k\ and k 2l the likelihood value for 
classifying a document containing that word to 
ci or c 2 would become larger, and that for clas- 
sifying it into C3 would become smaller. HCM, 
however, cannot do that. 



3 We calculate the probabilities he re bv using the so- 
called expected likelihood estimator ( |Gale and Church 
1990| ): 

f(kj\ci) + 0.5 



P(k i \c i ) = 



/( Cl ) +0.5 x m' 



(4) 



where f(kj\a) is the frequency of the cluster kj in Ci, 
f(ci) is the total frequency of clusters in Ci, and m is the 
total number of clusters. 



3 Finite Mixture Model 

We propose a method of document classification 
based on soft clustering of words. Let ci,---,c„ 
be categories. We first conduct the soft cluster- 
ing. Specifically we (a) define a vocabulary as a 
set W of words and define as clusters a number of 
its subsets fci, • • • , k m satisfying U^fcj = W; (no- 
tice that ki Ci kj = (i ^ j) does not necessarily 
hold here, i.e., a word can be assigned to several dif- 
ferent clusters); and (b) define for each cluster kj 
(j = 1, • • • , m) a distribution Q(w\kj) over its words 
Q2wek- Q( w \kj) = 1) and a distribution P(w\kj) 
satisfying: 



P(w\k 3 ) 



Q(w\kj) 
0; 



w e kj 

w 4l k 



3' 



(•5) 



where w denotes a random variable representing any 
word in the vocabulary. We then define for each cat- 
egory Cj (i = 1, • • • , n) a distribution of the clusters 



P{k 3 



and define for each category a linear com- 



bination of P(w\kj 



P(w\ Ci ) =J2P(k j \c i ) x P{w\k 3 ) 



3=1 



(6) 



as the distribution over its words, which is referred 



to as a, finite mixture model (e.g., (Everitt and Hand 
P81D )- 

We treat the problem of classifying a document 
as that of conducting the likelihood ratio test over 
finite mixture models. That is, we view a document 
as a sequence of words, 



d = wi, ■ ■ ■ , wn 



(7) 



where Wt(t — 1,---,N) represents a word. We 
assume that each word is independently generated 



according to an unknown probability distribution 
and determine which of the finite mixture mod- 
els P(w\ci)(i = l,---,n) is more likely to be the 
probability distribution by observing the sequence of 
words. Specifically, we calculate the likelihood value 
for each category with respect to the document by: 

L(d\ci) = L(wi, ■ ■ ■ , w N \ci) 

= nf=i(£™imi^xPMfc,))- 

(8) 

We then classify it into the category having the 
largest likelihood value with respect to it. Hereafter, 
we will refer to this method as FMM. 

FMM includes WBM and HCM as its special 
cases. If we consider the specific case (1) in which 
a word is assigned to a single cluster and P(w\kj) is 
given by 



P{w\kj 



0; 



W (f: k 



(9) 



j ■ 



where \kj\ denotes the number of elements belonging 
to kj , then we will get the same classification result 
as in HCM. In such a case, the likelihood value for 
each category q becomes: 



L(d\a) = Uti(P(kt\ci)xP(w t \k t )) 



(10) 

where kt is the cluster corresponding to Wt- Since 
the probability P(w t \k t ) does not depend on cate- 
gories, we can ignore the second term Ot=i P{ w t\kt) 
in hypothesis testing, and thus our method essen- 
tially becomes equivalent to HCM (c.f. Eq. (j|)). 

Further, in the specific case (2) in which m = n, 
for each j, P(w\kj) has \W\ parameters: P(w\kj) = 
P(w\cj), and P(kj\ci) is given by 



P(k 



3 



0; 



(11) 



the likelihood used in hypothesis testing becomes 
the same as that in Eq.(^J), and thus our method 
becomes equivalent to WBM. 

4 Estimation and Hypothesis 
Testing 

In this section, we describe how to implement our 
method. 

Creating clusters 

There are any number of ways to create clusters on a 
given set of words. As in the case of hard clustering, 



the way that clusters are created is crucial to the 
reliability of document classification. Here we give 
one example approach to cluster creation. 



Table 6: Clusters and words 



ki 

k 2 



racket, stroke, shot, ball 
kick, goal, ball 



We let the number of clusters equal that of cat- 
egories (i.e., m — n) ^ and relate each cluster ki 
to one category a (i — l,---,n). We then assign 
individual words to those clusters in whose related 
categories they most frequently appear. Letting 7 
(0 < 7 < 1) be a predetermined threshold value, if 
the following inequality holds: 



f(w\ c i 
fH 



> 7, 



(12) 



then we assign w to ki, the cluster related to Cj, 
where f(w\ci) denotes the frequency of the word w 
in category c,-, and f(w) denotes the total frequency 
of w. Using the data in Tab.[j], we create two clusters: 
k\ and fo, and relate them to c\ and C2, respectively. 
For example, when 7 = 0.4, we assign 'goal' to ki 
only, as the relative frequency of 'goal' in C2 is 0.75 
and that in c\ is only 0.25. We ignore in document 
classification those words which cannot be assigned 
to any cluster using this method, because they are 
not indicative of any specific category. (For example, 
when 7 > 0.5 'ball' will not be assigned into any 
cluster.) This helps to make classification efficient 
and accurate. Tab. || shows the results of creating 
clusters. 

Estimating P(w\kj) 

We then consider the frequency of a word in a clus- 
ter. If a word is assigned only to one cluster, we view 
its total frequency as its frequency within that clus- 
ter. For example, because 'goal' is assigned only to 
fc 2 , we use as its frequency within that cluster the to- 
tal count of its occurrence in all categories. If a word 
is assigned to several different clusters, we distribute 
its total frequency among those clusters in propor- 
tion to the frequency with which the word appears 
in each of their respective related categories. For 
example, because 'ball' is assigned to both k\ and 
k%, we distribute its total frequency among the two 
clusters in proportion to the frequency with which 
'ball' appears in c\ and C2, respectively. After that, 
we obtain the frequencies of words in each cluster as 
shown in Tab. ^. 

4 One can certainly assume that m > n. 



Table 7: Distributed frequencies of words 





racket 


stroke 


shot 


goal 


kick 


ball 




4 


1 


2 








2 


k 2 











4 


2 


2 



We then estimate the probabilities of words in 
each cluster, obtaining the results in Tab. ^.[] 



Table 8: Probability distributions of words 





racket 


stroke 


shot 


goal 


kick 


ball 


ki 


0.44 


0.11 


0.22 








0.22 


k 2 











0.50 


0.25 


0.25 



Table 9: Probability distributions of clusters 





ki fc 2 


Cl 
C2 


0.86 0.14 
0.04 0.96 



Estimating P(kj\ci) 

Let us next consider the estimation of P(kj\ci). 
There are two common methods for statistical esti- 
mation, the maximum likelihood estimation method 
and the Bayes estimation method. In their imple- 
mentation for estimating P(kj\ci), however, both of 
them suffer from computational intractability. The 



EM algorithm (Dempster, Laird, and Rubin, 1977) 
can be used to efficiently approximate the maximum 
likelihood estimator of P(kj\ci). We employ here an 
extended version of the EM algorithm ( Hclmbold ct 
al., 1995| ). (We have also devised, on the basis of 



the M arkov chain Monte Carlo (MCMC) techniq ue 
(e.g. QTanncr and Wong, 1987] ; [Yamanishi, 1996j ))f|, 
an algorithm to efficiently approximate the Bayes 
estimator of P(kj\ci).) 

For the sake of notational simplicity, for a fixed i, 
let us write P(kj\ci) as 9j and P{w\kj) as Pj(w). 

5 We calculate the probabilities by employing the 
maximum likelihood estimator: 



P(w\kj 



(13) 



where f(w\kj) is the frequency of w in kj, and f(kj) is 
the total frequency of words in kj. 

6 We have confirmed in our preliminary experiment 
that MCMC performs slightly better than EM in docu- 
ment classification, but we omit the details here due to 
space limitations. 



Then letting 9 = {0±, - ■ ■ ,0 m ), the finite mixture 
model in Eq. (j(|) may be written as 



P(w\9) =Y,0 j x Pj{w). 



(14) 



For a given training sequence w% ■ ■ ■ w^, the maxi- 
mum likelihood estimator of 9 is defined as the value 
9 which maximizes the following log likelihood func- 
tion 



L(9) 



N I ra 

t=l \j = l 



9 3 P 3 {w t ) 



(15) 



The EM algorithm first arbitrarily sets the initial 
value of 9, which we denote as 9^ , and then suc- 
cessively calculates the values of 9 on the basis of its 
most recent values. Let s be a predetermined num- 
ber. At the Ith iteration (I = 1, • • • , s), we calculate 



0« = (0« ...,0«)by 



- 1} (r,( V L(6«-% + 



(16) 



where rj > (when r\ = 1, Hembold et al. 's version 
simply becomes the standard EM algorithm), and 
XjL{9) denotes 



r , . < 9L dL 



(17) 



After s numbers of calculations, the EM algorithm 
outputs 9^ — (9[ s \ ■ ■ • , 9m) as an approximate of 
9. It is theoretically guaranteed that the EM algo- 
rithm converges to a local maximum of the given 
likelihood (Dempster, Laird, and Rubin, 1977). 

For the example in Tab. |1|, we obtain the results 
as shown in Tab. ^. 

Testing 

For the example in Tab. [I], we can calculate ac- 
cording to Eq. (||) the likelihood values of the two 
categories with respect to the document in Fig. |l| 
(Tab. |l0| shows the logarithms of the likelihood val- 
ues). We then classify the document into category 
C2, as log 2 L(d\c2) is larger than log 2 L{d\c{). 

5 Advantages of FMM 

For a probabilistic approach to document classifica- 
tion, the most important thing is to determine what 
kind of probability model (distribution) to employ 
as a representation of a category. It must (1) ap- 
propriately represent a category, as well as (2) have 
a proper preciseness in terms of number of param- 
eters. The goodness and badness of selection of a 
model directly affects classification results. 



Table 10: Calculating log likelihood values 



log 2 L(d|ci) 


= log 2 (.14x.25)- 


h 2 x log 2 (.14 x 


.50) - 


Klog 2 (.86 x .22 -1 


- .14 x .25) = -14.67 


log 2 L(d\c 2 ) 


= log 2 (.96 x .25) - 


h 2 x log 2 (.96 x 


.50) - 


Klog 2 (.04 x .22 -1 


- .96 x .25) = -6.18 



The finite mixture model we propose is particu- 
larly well-suited to the representation of a category. 



Another advantage of our method may be seen in 



contrast to the use of latent semantic analysis ( Deer 



Described in linguistic terms, a cluster corresponds 



to a topic and the words assigned to it are related 
to that topic. Though documents generally concen- 
trate on a single topic, they may sometimes refer 
for a time to others, and while a document is dis- 
cussing any one topic, it will naturally tend to use 
words strongly related to that topic. A document in 
the category of 'tennis' is more likely to discuss the 
topic of 'tennis,' i.e., to use words strongly related 
to 'tennis,' but it may sometimes briefly shift to the 
topic of 'soccer,' i.e., use words strongly related to 
'soccer.' A human can follow the sequence of words 
in such a document, associate them with related top- 
ics, and use the distributions of topics to classify the 
document. Thus the use of the finite mixture model 
can be considered as a stochastic implementation of 
this process. 



Table 11: Num. of parameters 



WBM 


0(n ■ 


\W\) 


HCM 


0(n 


■ m) 


FMM 


0(\k\ + 


n ■ m) 



The use of FMM is also appropriate from the 
viewpoint of number of parameters. Tab. |ll| shows 
the numbers of parameters in our method (FMM), 
HCM, and WBM, where \W\ is the size of a vocab- 
ulary, |fc| is the sum of the sizes of word clusters 
(i.e.,|/c| = Sj=i IM); 71 i s tne number of categories, 
and m is the number of clusters. The number of 
parameters in FMM is much smaller than that in 
WBM, which depends on \W\, a very large num- 
ber in practice (notice that is always smaller 
than \W\ when we employ the clustering method 
(with 7 > 0.5) described in Section 4. As a result, 
FMM requires less data for parameter estimation 
than WBM and thus can handle the data sparseness 
problem quite well. Furthermore, it can economize 
on the space necessary for storing knowledge. On 
the other hand, the number of parameters in FMM 
is larger than that in HCM. It is able to represent the 
differences between categories more precisely than 
HCM, and thus is able to resolve the two problems, 
described in Section 2, which plague HCM. 



wester et al., 199C) in document classification and 
document retrieval. They claim that their method 
can solve the following problems: 

synonymy problem how to group synonyms, like 
'stroke' and 'shot,' and make each relatively 
strongly indicative of a category even though 
some may individually appear in the category 
only very rarely; 

polysemy problem how to determine that a word 
like 'ball' in a document refers to a 'tennis ball' 
and not a 'soccer ball,' so as to classify the doc- 
ument more accurately; 

dependence problem how to use de- 
pendent words, like 'kick' and 'goal,' to make 
their combined appearance in a document more 
indicative of a category. 

As seen in Tab.|[ our method also helps resolve all 
of these problems. 

6 Preliminary Experimental Results 

In this section, we describe the results of the exper- 
iments we have conducted to compare the perfor- 
mance of our method with that of HCM and others. 

As a first data set, we used a subset of the Reuters 
newswire data prepared by Lewis, called Reuters- 
21578 Distribution 1.0.0 We selected nine overlap- 
ping categories, i.e. in which a document may be- 
long to several different categories. We adopted the 
Lewis Split in the corpus to obtain the training data 
and the test data. Tabs, [l^ and give the de- 
tails. We did not conduct stemming, or use stop 
wordfl We then applied FMM, HCM, WBM , and 
a method based on cosine-similarity, which we de- 
note as COS^, to conduct binary classification. In 

7 Reuters-21578 is available at 



tit t p : / / www . research . att . com /lewis . 

01 Stop words' refers to a predetermined list of words 
containing those which are considered not useful for doc- 
ument classification, such as articles and prepositions. 

9 In this method, categories and documents to be clas- 
sified are viewed as vectors of word frequencies, and the 
cosine value between the two vectors reflects similarity 
(|5alton and McGill, 1983). 



particular, we learn the distribution for each cate- 
gory and that for its complement category from the 
training data, and then determine whether or not to 
classify into each category the documents in the test 
data. When applying FMM, we used our proposed 
method of creating clusters in Section 4 and set 7 
to be 0, 0.4, 0.5, 0.7, because these are representative 
values. For HCM, we classified words in the same 
way as in FMM and set 7 to be 0.5,0.7,0.9,0.95. 
(Notice that in HCM, 7 cannot be set less than 0.5.) 



Table 12: The first data set 



Num. of doc. in training data 707 

Num. of doc in test data 228 

Num. of (type of) words 10902 

Avg. num. of words per doc. 310.6 



Table 13: Categories in the first data set 



wheat , corn, oilseed , sugar , coffee 
soybean, cocoa, rice, cotton 



Table 14: The second data set 



Num. of doc. training data 13625 

Num. of doc. in test data 6188 

Num. of (type of) words 50301 

Avg. num. of words per doc. 181.3 



As a second data set, we used the entire Reuters- 
21578 data with the Lewis Split. Tab. [b| gives the 
details. Again, we did not conduct stemming, or use 
stop words. We then applied FMM, HCM, WBM , 
and COS to conduct binary classification. When ap- 
plying FMM, we used our proposed method of creat- 
ing clusters and set 7 to be 0, 0.4, 0.5, 0.7. For HCM, 
we classified words in the same way as in FMM and 
set 7 to be 0.5, 0.7, 0.9, 0.95. We have not fully com- 
pleted these experiments, however, and here we only 
give the results of classifying into the ten categories 
having the greatest numbers of documents in the test 
data (see Tab. ^5j). 

For both data sets, we evaluated each method in 
terms of precision and recall by means of the so- 
called micro-averaging PI. 



10 In micro-averaging( |Lewis and Ringuette, 1994 ), pre- 
cision is defined as the percentage of classified documents 
in all categories which are correctly classified. Recall is 
defined as the percentage of the total documents in all 
categories which are correctly classified. 



Table 15: Tested categories in the second data set 



earn , acq, crude , money- fx, grain 
interest, trade, ship, wheat, corn 



When applying WBM, HCM, and FMM, rather 
than use the standard likelihood ratio testing, we 
used the following heuristics. For simplicity, suppose 
that there are only two categories c\ and C2. Letting 
e be a given number larger than or equal 0, we assign 
a new document d in the following way: 



i(\og L(d\ Cl ) 
i(\og L(d\c 2 ) 
otherwise; 



logL(d|c 2 )) > e; 
logL(d|ci)) >e; 



d—*ci, 
d -> c 2 , 
unclassify d, 

(18) 

where N is the size of document d. (One can easily 
extend the method to cases with a greater number of 
categories.) For COS, we conducted classification 
in a similar way. 

Figs. H and || show precision-recall curves for the 
first data set and those for the second data set, re- 
spectively. In these graphs, values given after FMM 
and HCM represent 7 in our clustering method (e.g. 
FMM0.5, HCM0.5,etc). We adopted the break-even 
point as a single measure for comparison, which is 
the one at which precision equals recall; a higher 
score for the break-even point indicates better per- 
formance. Tab. [ll] shows the break-even point for 
each method for the first data set and Tab. |l7] shows 
that for the second data set. For the first data set, 
FMM0 attains the highest score at break-even point; 
for the second data set, FMM0.5 attains the highest. 



"COS" -«- 
"WBM" 
"HCM0.5" □ 
■HCM0.7" x 
"HCM0.9" 
"HCM0.95" * 
"FMM0" 
"FMM0.4" 
"FMM0.5" □ 
"FMM0.7" -M- 




Figure 2: Precision-recall curve for the first data set 



11 Notice that words which are discarded in the cluster- 
ing process should not to be counted in document size. 




Table 17: Break-even point for the second data set 



Figure 3: Precision- recall curve for the second data 
set 



Table 16: Break-even point for the first data set 



COS 


0.60 


WBM 


0.62 


HCM0.5 


0.32 


HCM0.7 


0.42 


HCM0.9 


0.54 


HCM0.95 


0.51 


FMMO 


0.66 


FMM0.4 


0.54 


FMM0.5 


0.52 


FMM0.7 


0.42 



We considered the following questions: 

(1) The training data used in the experimen- 
tation may be considered sparse. Will a word- 
clustering-based method (FMM) outperform a word- 
based method (WBM) here? 

(2) Is it better to conduct soft clustering (FMM) 
than to do hard clustering (HCM)? 

(3) With our current method of creating clusters, 
as the threshold 7 approaches 0, FMM behaves much 
like WBM and it does not enjoy the effects of clus- 
tering at all (the number of parameters is as large 
as in WBM). This is because in this case (a) a word 
will be assigned into all of the clusters, (b) the dis- 
tribution of words in each cluster will approach that 
in the corresponding category in WBM, and (c) the 
likelihood value for each category will approach that 
in WBM (recall case (2) in Section 3). Since creating 
clusters in an optimal way is difficult, when cluster- 
ing does not improve performance we can at least 
make FMM perform as well as WBM by choosing 
7 = 0. The question now is "does FMM perform 
better than WBM when 7 is 0?" 



COS 


0.52 


WBM 


0.62 


HCM0.5 


0.47 


HCM0.7 


0.51 


HCM0.9 


0.55 


HCM0.95 


0.31 


FMMO 


0.62 


FMM0.4 


0.54 


FMMO. 5 


0.67 


FMMO. 7 


0.62 



In looking into these issues, we found the follow- 
ing: 

(1) When 7>0, i.e., when we conduct clustering, 
FMM does not perform better than WBM for the 
first data set, but it performs better than WBM for 
the second data set. 

Evaluating classification results on the basis of 
each individual category, we have found that for 
three of the nine categories in the first data set, 
FMMO. 5 performs best, and that in two of the ten 
categories in the second data set FMMO. 5 performs 
best. These results indicate that clustering some- 
times does improve classification results when we 
use our current way of creating clusters. (Fig. ^ 
shows the best result for each method for the cate- 
gory 'corn' in the first data set and Fig. || that for 
'grain' in the second data set.) 

(2) When 7 ^> 0, i.e., when we conduct clustering, 
the best of FMM almost always outperforms that of 
HCM. 

(3) When 7 = 0, FMM performs better than 
WBM for the first data set, and that it performs 
as well as WBM for the second data set. 

In summary, FMM always outperforms HCM; in 
some cases it performs better than WBM; and in 
general it performs at least as well as WBM. 

For both data sets, the best FMM results are supe- 
rior to those of COS throughout. This indicates that 
the probabilistic approach is more suitable than the 
cosine approach for document classification based on 
word distributions. 

Although we have not completed our experiments 
on the entire Reuters data set, we found that the re- 
sults with FMM on the second data set are almost as 
good as th ose obtained by the other a pproaches re- 
ported in ( Lewis and Ringucttc, 1994 ). (The results 
are not directly comparable, because (a) the results 
in ( Lewis and Ringuette, 1994 ) were obtained from 
an older version of the Reuters data; and (b) they 
used stop words, but we did not.) 




Figure 4: Precision-recall curve for category 'corn' 



■cos- 

"WBM" + 

'■HCM0.7" □ 

"FMM0.5" x 




Figure 5: Precision-recall curve for category 'grain' 



We have also conducted experiments on the Su- 
sanne corpus datap^| and confirmed the effectiveness 
of our method. We omit an explanation of this work 
here due to space limitations. 

7 Conclusions 

Let us conclude this paper with the following re- 
marks: 

1. The primary contribution of this research is 
that we have proposed the use of the finite mix- 
ture model in document classification. 

2. Experimental results indicate that our method 
of using the finite mixture model outperforms 
the method based on hard clustering of words. 



12 The Susanne corpus, which has four non -overlapping 
categories, is available at :tp: / /ota.ox.ac,uk| 



3. Experimental results also indicate that in some 
cases our method outperforms the word-based 
method when we use our current method of cre- 
ating clusters. 

Our future work is to include: 

1. comparing the various methods over the entire 
Reuters corpus and over other data bases, 

2. developing better ways of creating clusters. 

Our proposed method is not limited to document 
classification; it can also be applied to other natu- 
ral language processing tasks, like word sense dis- 
ambiguation, in which we can view the context sur- 
rounding a ambiguous target word as a document 
and the word-senses to be resolved as categories. 
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